Maintaining a healthy and consistent diet makes for our most basic and important needs. Access to food sources, however, may be limited for many people, leading to issues of food insecurity. In 2010, 14.5% (17.2 million households) in the U.S. were food insecure, 5.4% of which had very low food security Coleman-Jensen et al. (2011).
Many are left to ask: “Where’s my next meal going to come from?”
A potential factor that plays a role in determining whether households have a good and consistent supply of nutritious food may be their accessibility to food. Food accessibility is associated with the proximity to grocery stores, food markets, and other food sources. This project looks at the scope of food accessibility for households in different counties. The project also covers the effect of food environment factors such as income, population demographics, and the availability of nutrition programs on food accessibility and food insecurity.
Data
The project uses the Food Environment Atlas Data from 9/10/2020 Economic Research Service (n.d.a). The dataset has been compiled by the US Department of Agriculture’s (USDA) Economic Research Service (ERS) to study factors that affect food choices and the accessibility of healthy foods in communities. The information in the dataset had been aggregated from various reports from the USDA, the Bureau of the Census, the U.S. Department of Commerce, and more for the years 2006 to 2019. Food accessibility population data, which is the focus of the project, were given for 2010 and 2015.
The dataset contains county-level information on food environment factors such as access to grocery stores/supermarkets/restaurants, local food sales, food prices, food assistance programs like SNAP (Supplemental Nutrition Assistance Program), National School Lunch Program, etc., socioeconomic characteristics, and some health/physical activities. State-level information on household food insecurity is also given. The dataset contained data on 3143 counties, each uniquely identified with their FIPS code.
To enable geographic visualizations and analysis, an additional file from the ArcGIS Hub containing geographic boundary shapes is used.
Preprocessing
The information in the dataset was stored in separate worksheets, grouped according to the category of the features. Since there was a large number of features, the first step was to perform feature reduction. This was done by manually selecting the features that were related to the project’s questions and goals. Features that contained very granular information were also omitted in order to simplify analysis as well as to obtain more generalized insights.
The dataset and boundary shape file was also compared to check if county FIPS codes and names matched for all observations. Some county names and FIPS codes had changed over the years, which caused discrepancies between the two datasets. These counties were identified and modified to reflect the changes and match all observations. This brought the number of observations from 3143 to 3142 counties.
For columns with a low number of missing values, imputation was done using the median of those features. This was done while looking at variable distributions to ensure that imputation does not significantly change their distributions. However, median imputation was not appropriate for some 2015 variables that had a larger number of missing values. For these variables, missing values were replaced with numbers from the corresponding feature for 2010.
Overall, the data preprocessing involved merging data, selecting relevant variables and imputing missing values while exploring the nature of the variables. The cleaned dataset was then saved to a new file that is used for analysis.
Visualizations
Using the cleaned dataset, some variables of interest were explored by producing visualizations.
Food Accessibility
To gauge the level of food accessibility in counties, the dataset provides the percentage of the population with low food accessibility. For this dataset, low accessibility populations are considered to be those “living more than 1 mile from a supermarket or large grocery store if in an urban area, or more than 10 miles from a supermarket or large grocery store if in a rural area” Economic Research Service (n.d.b).
Code
import warningswarnings.filterwarnings("ignore")import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport geopandas as gpdimport plotly.express as pximport plotly.io as piofrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import Ridgefrom sklearn.linear_model import Ridge, RidgeCVfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_errorfrom sklearn.tree import DecisionTreeRegressor
Figure 1: Distribution of County % Population with Low Food Accessibility
The distribution of the variable reflects an interesting characteristic of food accessibility in the U.S. The two distinct modes of distribution suggest that counties can fall under two distinct groups based on the extent of food accessibility. For the majority of the counties, <40% of the population has low food accessibility, while for around 130 counties, the entire population (reaching 100%) has low accessibility to food sources.
Figure 2: Low Food Accessibility Population Percentages in 2010
The choropleth shows that low food accessibility counties are concentrated in the Midwest, Southwest, and West regions of the country. Since food accessibility is defined based on the proximity of households to food stores, the location of the low-access counties could be the product of geographic conditions and living conditions in remote regions.
The dataset measures food accessibility based on the proximity of residential areas to food stores. However, it is important to note that the availability of vehicles can also play a significant role in determining food accessibility. While some populations may live far from stores, having a car or access to transportation can help cover greater distances to purchase food.
Code
def plt_dis(c): f = sns.displot(data=df, x=c, y='PCT_LACCESS_POP10', height=4, aspect=10/8.27, bins=10) plt.xlim([-5, 100]) plt.ylabel("Low Food Access Pop. %", alpha=0.9) plt.xlabel("Household % with Low Food Access & No Vehicle", alpha=0.8) plt.show()plt_dis('PCT_LACCESS_HHNV10')
Figure 4: Low Food Access Population and Household Vehicle Availability
To take account of this additional variable, we look at the density plot of household vehicle availability and county food accessibility. The blocks farther away from the axes represent the areas where both food accessibility and vehicle availability for the populations living there are low. These counties are likely to have the biggest challenges of food accessibility.
In other counties, although food accessibility is bad, most households have access to vehicles (shown by the few dark areas). For most counties, 0-40% have low food accesibility, but only <10% of the households with low food access do not have vehicles. Additionally, the lower percentage of households having vehicles is seen for some of the counties where larger proportion of populations did not have good food access. This could mean that the problem of food accessibility in these counties may be lower than depicted earlier.
Income and Food Access
Income is another crucial factor that can impact access to food. Low-income groups may be more sensitive to food prices and the lack of food availability in close proximity to their homes. The scatter plot charts the population with low food accessibility and populations with both low income and low food access. Here, low income is defined as “annual family income of less than or equal to 200 percent of the Federal poverty threshold based on family size” Economic Research Service (n.d.b).
Code
sns.set(color_codes=True)f, ax = plt.subplots(figsize=(6, 5))sns.scatterplot(data=df, x='PCT_LACCESS_POP10', y='PCT_LACCESS_LOWI10')ax.set_ylabel("County Population % with Low Access and Low Income", size =12, alpha=0.9)ax.set_xlabel("County Population % with Low Food Access",size =12, alpha=0.9)ax.tick_params(axis='both', which='major', labelsize=14)plt.xlim([0, 102])xpoints = ypoints = ax.get_xlim()ax.plot(xpoints, ypoints, linestyle='--', color='cornflowerblue', lw=1, scalex=False, scaley=False)f.show()
Figure 5: Low Food Access Population and Household Vehicle Availability
The plot shows that the data points always fall below the identity function line. So, the proportion of population with both low food access and low incom is less than the population percentage with low food access. This suggests that not whole of the population with low food access have low income (as defined for the datset). However, the plot also shows that counties with worse food access have greater percentages of people with low access and low income. Thus, income can be correlated with food accessibility. Furthermore, the variation in these percentage values increases as accessibility worsens. This is also reminiscent of the variation in household availability for counties with bad food access.
Demographic Characteristics
In counties where >80% of the population has low food accessibility, the majority of the population is white, followed by Hispanic and American Indian / Alaska Native.
Code
from IPython.display import Markdownfrom tabulate import tabulate# Create a separate dataframe containing county info and race categoriesdf_race = pd.DataFrame(df, columns=['FIPS', 'State', 'County', 'PCT_LACCESS_POP10','PCT_NHWHITE10','PCT_NHBLACK10', 'PCT_HISP10', 'PCT_NHASIAN10','PCT_NHNA10','PCT_NHPI10', 'PCT_LACCESS_POP15', 'PCT_LACCESS_WHITE15','PCT_LACCESS_BLACK15', 'PCT_LACCESS_HISP15','PCT_LACCESS_NHASIAN15', 'PCT_LACCESS_NHNA15', 'PCT_LACCESS_NHPI15','PCT_LACCESS_MULTIR15'])# Store avg pct values for each race here, create table using this dictionarytable = []# all race categories as given by the datasetrace_catg = ["WHITE", "BLACK", "HISPANIC", "ASIAN", "AMER INDIAN / ALASKA NATIVE", "HAWAIIAN / PACIFIC ISLANDER", "MULTIRACIAL"]# 2010 very low access counties low_access10 = df_race[df['PCT_LACCESS_POP10']>80]# Update dictionary with 2010 mean values for very low access countiesfor (mean_pct,catg) inzip(low_access10.mean()[1:7],race_catg): table.append([catg, round(mean_pct, 2)])# 2015 low_access15 = df_race[df['PCT_LACCESS_POP15']>80]for (mean_pct,i) inzip(low_access15.mean()[8:],range(len(race_catg))): mean_pct =round(mean_pct, 2)try: table[i].append(mean_pct)exceptIndexError: table[-1].append(0) table[-1].append(mean_pct) # accounting for the extra MULTIRACIAL catg for 2015Markdown(tabulate(table, headers=["Race","2010", "2015"]))
Race
2010
2015
WHITE
84.93
85.05
BLACK
0.6
0.91
HISPANIC
9.26
9.92
ASIAN
0.3
0.34
AMER INDIAN / ALASKA NATIVE
3.63
5.61
HAWAIIAN / PACIFIC ISLANDER
0.03
0.06
Features’ Relationships
Next, we investigate the dependence of some variables in the dataset. First, we look at how food accessibility variables affect food insecurity feature given in the dataset. Second, we look at the presence of food assistance program compared with food accessibility levels.
Food Insecurity and Accessibility
The dataset provides state-level food insecurity information. Food-insecure households are those that “were unable, at times during the year, to provide adequate food for one or more household members because the household lacked money and other resources for food” Economic Research Service (n.d.b). The values measuring food insecurity were given as three-year averages. In the following model, the food insecurity percent average for 2012-2014 is used as the response variable for the training set whereas the food insecurity percent average for 2015-2017 is used as the response variable for the test set. The data frame used here was created by grouping the rows by state and taking the average. Thus, the data frame contains 51 rows with state-level information.
The linear regression returns very low R2 scores suggesting that the food- accessibility variables may not entirely represent the level of food insecurity in states.
Food Assistance Programs
The dataset gives several variables related to food assistance programs. Here, we look at some of these variables to see if these programs are available in areas with low food accessibility. The input variables here are Students eligible for free lunch (%), Students eligible for reduced-price lunch (%), SNAP participants (% pop), SNAP benefits per capita (% change), National School Lunch Program participants (%), School Breakfast Program participants (% children), Summer Food Service Program participants (% children), FDPIR (Food Distribution Program on Indian Reservations) sites. Regression models are created separately for 2010 and 2015 to see how the availability of food assistance programs and food accessibility are related for both years. Following the previous analysis, the same state-level data frame is used for the regression models.
For both years, the r2 scores are similar and less than 0.5. This suggests that the prevalence of food assistance programs is only partly correlated with food accessibility. Although food assistance programs exist to some extent in states with poor food accessibility, this may not be the case for all states.
Change in Food Accessibility (2010-15)
Lastly, regression models are used to analyze what factors can affect food accessibility in U.S. counties.
Note: The income and household vehicle availability variables are removed here since they are largely correlated with the food accessibility response variables.
Code
# create 2010 and 2015 dataframes for all countiesdf10 = pd.DataFrame(df, columns = df10_cols)df15 = pd.DataFrame(df, columns = df15_cols)# Rename columnsdf10.rename(columns =lambda x : str(x)[:-2], inplace=True)df10.rename(columns={'FOODINSEC_12_': 'FOODINSEC','VLFOODSEC_12_': 'VLFOODSEC'}, inplace=True)df15.rename(columns =lambda x : str(x)[:-2], inplace =True)df15.rename(columns={'FOODINSEC_15_': 'FOODINSEC','VLFOODSEC_15_': 'VLFOODSEC'}, inplace=True)# Drop columnsdf10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK','PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 'PCT_65OLDER', 'PCT_18YOUNGER'], inplace=True)df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE','PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR'], inplace=True)# Create dataframe with differences between 2010 & 2015 values as columnsdiff = pd.DataFrame()for c in df10.columns: diff[c] = df15[c]-df10[c]# Scale all columnsscaler = StandardScaler()diff = pd.DataFrame(scaler.fit_transform(diff), columns=diff.columns)# Drop correlating variables and create train/test dataX = diff.drop(columns=['PCT_LACCESS_POP','PCT_LACCESS_LOWI','PCT_LACCESS_HHNV'], axis=1)y = diff['PCT_LACCESS_POP']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Apply linear regressionX = sm.add_constant(X_train)model = sm.OLS(y_train, X).fit()model.summary()print('R-squared:', model.rsquared)print('P-values:', model.pvalues)
The model used here does not produce good R2 scores. One reason for this may be that the counties for which there are significant changes are low in number and may seem like unusual values for the dataset. This is shown by the distribution plot shown in Figure 6.
Code
# Plot distribution of difference in low food access pop% def plt_dis(c): f = sns.displot(data=diff, x=c, height=4, aspect=10/8.27, bins=20) plt.ylabel("Number of Counties", alpha=0.8) plt.xlabel("Change in Low Food Access Pop. % (standardized)", alpha=0.8) plt.show()plt_dis('PCT_LACCESS_POP')
Figure 6: Distribution of Change in Low Access Population (Standardized)
Now, we create a model using data for only the counties that have significant changes in food accessibility. For this, the dataset was filtered to contain only counties for which the standardized Change in Food Accessibility was greater than 1.5 and less than -1.5.
Code
# filter data for extreme casesdf_extreme = diff[diff['PCT_LACCESS_POP'] >1.5]df_extreme = diff[diff['PCT_LACCESS_POP'] <-1.5]# Drop correlated variables and create train/test dataX = df_extreme.drop(columns=['PCT_LACCESS_POP','PCT_LACCESS_LOWI','PCT_LACCESS_HHNV'], axis=1)y = df_extreme['PCT_LACCESS_POP']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Linear regressionprint("Linear Regression")X = sm.add_constant(X_train)model = sm.OLS(y_train, X).fit()model.summary()print('R-squared:', model.rsquared)print('P-values:', model.pvalues)# Decision Treeprint("Decision Tree Regression")dt_model = DecisionTreeRegressor(random_state=42)dt_model.fit(X_train, y_train)dt_pred = dt_model.predict(X_test)dt_score = dt_model.score(X_test, y_test)print("Decision Tree score:", dt_score)
The R2 score for linear regression improved significantly for this data frame. The R2 further improves when using a Decision Tree Regrressor.
Conlusion
Coleman-Jensen, Alisha, Mark Nord, Margaret Andrews, and Steven Carlson. 2011. “Household Food Security in the United States in 2010.”ERR-125, U.S. Dept. Of Agriculture, Econ. Res. Serv.
Economic Research Service, U. S. Department of Agriculture (USDA). n.d.a. “Food Environment Atlas.”
---title: "Food Accessibility in U.S. Counties"author: "Tenzin Sherpa"bibliography: references.bibnumber-sections: falseformat: html: theme: flatly rendering: default toc: true toc-title: Contents code-fold: true code-tools: true font-family: Georgia, serif; pdf: defaultjupyter: python3---# The IssueMaintaining a healthy and consistent diet makes for our most basic and important needs. Access to food sources, however, may be limited for many people, leading to issues of food insecurity. In 2010, 14.5% (17.2 million households) in the U.S. were food insecure, 5.4% of which had very low food security @coleman. <pstyle="text-align: center;">Many are left to ask: *“Where’s my next meal going to come from?”*</p>A potential factor that plays a role in determining whether households have a good and consistent supply of nutritious food may be their accessibility to food. Food accessibility is associated with the proximity to grocery stores, food markets, and other food sources. This project looks at the scope of food accessibility for households in different counties. The project also covers the effect of food environment factors such as income, population demographics, and the availability of nutrition programs on food accessibility and food insecurity. # Data The project uses the Food Environment Atlas Data from 9/10/2020 @dataset. The dataset has been compiled by the US Department of Agriculture's (USDA) Economic Research Service (ERS) to study factors that affect food choices and the accessibility of healthy foods in communities. The information in the dataset had been aggregated from various reports from the USDA, the Bureau of the Census, the U.S. Department of Commerce, and more for the years 2006 to 2019. Food accessibility population data, which is the focus of the project, were given for 2010 and 2015. The dataset contains county-level information on food environment factors such as access to grocery stores/supermarkets/restaurants, local food sales, food prices, food assistance programs like SNAP (Supplemental Nutrition Assistance Program), National School Lunch Program, etc., socioeconomic characteristics, and some health/physical activities. State-level information on household food insecurity is also given. The dataset contained data on 3143 counties, each uniquely identified with their FIPS code.To enable geographic visualizations and analysis, an additional file from the ArcGIS Hub containing geographic boundary shapes is used.## PreprocessingThe information in the dataset was stored in separate worksheets, grouped according to the category of the features. Since there was a large number of features, the first step was to perform feature reduction. This was done by manually selecting the features that were related to the project's questions and goals. Features that contained very granular information were also omitted in order to simplify analysis as well as to obtain more generalized insights.The dataset and boundary shape file was also compared to check if county FIPS codes and names matched for all observations. Some county names and FIPS codes had changed over the years, which caused discrepancies between the two datasets. These counties were identified and modified to reflect the changes and match all observations. This brought the number of observations from 3143 to 3142 counties. For columns with a low number of missing values, imputation was done using the median of those features. This was done while looking at variable distributions to ensure that imputation does not significantly change their distributions. However, median imputation was not appropriate for some 2015 variables that had a larger number of missing values. For these variables, missing values were replaced with numbers from the corresponding feature for 2010. Overall, the data preprocessing involved merging data, selecting relevant variables and imputing missing values while exploring the nature of the variables. The cleaned dataset was then saved to a new file that is used for analysis.# VisualizationsUsing the cleaned dataset, some variables of interest were explored by producing visualizations. ### Food Accessibility To gauge the level of food accessibility in counties, the dataset provides the percentage of the population with low food accessibility. For this dataset, low accessibility populations are considered to be those "living more than 1 mile from a supermarket or large grocery store if in an urban area, or more than 10 miles from a supermarket or large grocery store if in a rural area" @documentation. ```{python}import warningswarnings.filterwarnings("ignore")import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport geopandas as gpdimport plotly.express as pximport plotly.io as piofrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import Ridgefrom sklearn.linear_model import Ridge, RidgeCVfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_errorfrom sklearn.tree import DecisionTreeRegressor``````{python}#| label: fig-accessdist10#| fig-cap: "Distribution of County % Population with Low Food Accessibility"df = pd.read_pickle(r'./../data/AtlasCleaned.pkl')sns.set(style='darkgrid')def plt_dis(c): f = sns.displot(data=df, x=c, height=4, aspect=10/8.27, bins=20) plt.xlim([0, 100]) plt.ylabel("Number of Counties", alpha=0.8) plt.xlabel("Low Food Access Pop. %", alpha=0.8) plt.show()plt_dis('PCT_LACCESS_POP10')```The distribution of the variable reflects an interesting characteristic of food accessibility in the U.S. The two distinct modes of distribution suggest that counties can fall under two distinct groups based on the extent of food accessibility. For the majority of the counties, <40% of the population has low food accessibility, while for around 130 counties, the entire population (reaching 100%) has low accessibility to food sources. **Geographic Visualization**```{python}#| label: fig-map10#| fig-cap: "Low Food Accessibility Population Percentages in 2010"df = pd.read_pickle(r'./../data/AtlasCleaned.pkl')import jsonwithopen(r'./../src/USA_Counties_(Generalized).geojson') as f: tract = json.load(f)def plot_chlpth(df, x, label, header): fig = px.choropleth(df, locations='FIPS', color=x, color_continuous_scale="Viridis", range_color=(0, 100), geojson=tract, featureidkey ="properties.FIPS", scope="usa", hover_data=['State', 'County'], labels=label, title=header) fig.update_layout(legend=dict(orientation="h", yanchor="top",y=0.5,xanchor="right",x=1)) fig.show()plot_chlpth(df, 'PCT_LACCESS_POP10', {'PCT_LACCESS_POP10':'Popn. %'}, None)```The choropleth shows that low food accessibility counties are concentrated in the Midwest, Southwest, and West regions of the country. Since food accessibility is defined based on the proximity of households to food stores, the location of the low-access counties could be the product of geographic conditions and living conditions in remote regions. ```{python}#| label: fig-mapChange#| fig-cap: "Change in Low Food Access Pop. %"df_diff = pd.DataFrame(df, columns = ['FIPS', 'State', 'County', 'PCT_LACCESS_POP10', 'PCT_LACCESS_POP15'])df_diff['diff'] = df['PCT_LACCESS_POP15'] - df['PCT_LACCESS_POP10']fig = px.choropleth(df_diff, locations='FIPS', color='diff', color_continuous_scale="RdBu_r", range_color=(df_diff['diff'].min(), df_diff['diff'].max()), geojson=tract, featureidkey ="properties.FIPS", scope="usa", hover_data=['State', 'County'], labels={'diff':'Change in %'})fig.update_layout(legend=dict(orientation="h", yanchor="top",y=0.5,xanchor="right",x=1))fig.show()```### Food and Vehicle AccessThe dataset measures food accessibility based on the proximity of residential areas to food stores. However, it is important to note that the availability of vehicles can also play a significant role in determining food accessibility. While some populations may live far from stores, having a car or access to transportation can help cover greater distances to purchase food. ```{python}#| label: fig-car10#| fig-cap: "Low Food Access Population and Household Vehicle Availability"def plt_dis(c): f = sns.displot(data=df, x=c, y='PCT_LACCESS_POP10', height=4, aspect=10/8.27, bins=10) plt.xlim([-5, 100]) plt.ylabel("Low Food Access Pop. %", alpha=0.9) plt.xlabel("Household % with Low Food Access & No Vehicle", alpha=0.8) plt.show()plt_dis('PCT_LACCESS_HHNV10')```To take account of this additional variable, we look at the density plot of household vehicle availability and county food accessibility. The blocks farther away from the axes represent the areas where both food accessibility and vehicle availability for the populations living there are low. These counties are likely to have the biggest challenges of food accessibility. In other counties, although food accessibility is bad, most households have access to vehicles (shown by the few dark areas). For most counties, 0-40% have low food accesibility, but only <10% of the households with low food access do not have vehicles. Additionally, the lower percentage of households having vehicles is seen for some of the counties where larger proportion of populations did not have good food access. This could mean that the problem of food accessibility in these counties may be lower than depicted earlier. ### Income and Food AccessIncome is another crucial factor that can impact access to food. Low-income groups may be more sensitive to food prices and the lack of food availability in close proximity to their homes. The scatter plot charts the population with low food accessibility and populations with both low income and low food access. Here, low income is defined as "annual family income of less than or equal to 200 percent of the Federal poverty threshold based on family size" @documentation.```{python}#| label: fig-income10#| fig-cap: "Low Food Access Population and Household Vehicle Availability"#| sns.set(color_codes=True)f, ax = plt.subplots(figsize=(6, 5))sns.scatterplot(data=df, x='PCT_LACCESS_POP10', y='PCT_LACCESS_LOWI10')ax.set_ylabel("County Population % with Low Access and Low Income", size =12, alpha=0.9)ax.set_xlabel("County Population % with Low Food Access",size =12, alpha=0.9)ax.tick_params(axis='both', which='major', labelsize=14)plt.xlim([0, 102])xpoints = ypoints = ax.get_xlim()ax.plot(xpoints, ypoints, linestyle='--', color='cornflowerblue', lw=1, scalex=False, scaley=False)f.show()```The plot shows that the data points always fall below the identity function line. So, the proportion of population with both low food access and low incom is less than the population percentage with low food access. This suggests that not whole of the population with low food access have low income (as defined for the datset). However, the plot also shows that counties with worse food access have greater percentages of people with low access and low income. Thus, income can be correlated with food accessibility. Furthermore, the variation in these percentage values increases as accessibility worsens. This is also reminiscent of the variation in household availability for counties with bad food access.### Demographic CharacteristicsIn counties where >80% of the population has low food accessibility, the majority of the population is white, followed by Hispanic and American Indian / Alaska Native.```{python}from IPython.display import Markdownfrom tabulate import tabulate# Create a separate dataframe containing county info and race categoriesdf_race = pd.DataFrame(df, columns=['FIPS', 'State', 'County', 'PCT_LACCESS_POP10','PCT_NHWHITE10','PCT_NHBLACK10', 'PCT_HISP10', 'PCT_NHASIAN10','PCT_NHNA10','PCT_NHPI10', 'PCT_LACCESS_POP15', 'PCT_LACCESS_WHITE15','PCT_LACCESS_BLACK15', 'PCT_LACCESS_HISP15','PCT_LACCESS_NHASIAN15', 'PCT_LACCESS_NHNA15', 'PCT_LACCESS_NHPI15','PCT_LACCESS_MULTIR15'])# Store avg pct values for each race here, create table using this dictionarytable = []# all race categories as given by the datasetrace_catg = ["WHITE", "BLACK", "HISPANIC", "ASIAN", "AMER INDIAN / ALASKA NATIVE", "HAWAIIAN / PACIFIC ISLANDER", "MULTIRACIAL"]# 2010 very low access counties low_access10 = df_race[df['PCT_LACCESS_POP10']>80]# Update dictionary with 2010 mean values for very low access countiesfor (mean_pct,catg) inzip(low_access10.mean()[1:7],race_catg): table.append([catg, round(mean_pct, 2)])# 2015 low_access15 = df_race[df['PCT_LACCESS_POP15']>80]for (mean_pct,i) inzip(low_access15.mean()[8:],range(len(race_catg))): mean_pct =round(mean_pct, 2)try: table[i].append(mean_pct)exceptIndexError: table[-1].append(0) table[-1].append(mean_pct) # accounting for the extra MULTIRACIAL catg for 2015Markdown(tabulate(table, headers=["Race","2010", "2015"]))```# Features' RelationshipsNext, we investigate the dependence of some variables in the dataset. First, we look at how food accessibility variables affect food insecurity feature given in the dataset. Second, we look at the presence of food assistance program compared with food accessibility levels. ### Food Insecurity and Accessibility The dataset provides state-level food insecurity information. Food-insecure households are those that “were unable, at times during the year, to provide adequate food for one or more household members because the household lacked money and other resources for food” @documentation. The values measuring food insecurity were given as three-year averages. In the following model, the food insecurity percent average for 2012-2014 is used as the response variable for the training set whereas the food insecurity percent average for 2015-2017 is used as the response variable for the test set. The data frame used here was created by grouping the rows by state and taking the average. Thus, the data frame contains 51 rows with state-level information. ```{python}df.drop(labels=['LACCESS_POP10', 'LACCESS_POP15', 'PCH_LACCESS_POP_10_15', 'LACCESS_LOWI10', 'LACCESS_LOWI15','PCH_LACCESS_LOWI_10_15', 'LACCESS_HHNV10', 'LACCESS_HHNV15', 'PCH_LACCESS_HHNV_10_15','LACCESS_SNAP15', 'LACCESS_CHILD10', 'LACCESS_CHILD15', 'LACCESS_CHILD_10_15', 'LACCESS_SENIORS10', 'LACCESS_SENIORS15', 'PCH_LACCESS_SENIORS_10_15', 'LACCESS_WHITE15', 'LACCESS_BLACK15', 'LACCESS_HISP15', 'LACCESS_NHASIAN15', 'LACCESS_NHNA15', 'LACCESS_NHPI15', 'LACCESS_MULTIR15'], axis=1, inplace=True)df_avg = df.groupby(['State']).mean()df10_cols = ['PCT_LACCESS_POP10', 'PCT_LACCESS_LOWI10', 'PCT_LACCESS_HHNV10','PCT_LACCESS_CHILD10','PCT_LACCESS_SENIORS10', 'PCT_FREE_LUNCH10', 'PCT_REDUCED_LUNCH10','PCT_NHWHITE10', 'PCT_NHBLACK10', 'PCT_HISP10', 'PCT_NHASIAN10', 'PCT_NHNA10', 'PCT_NHPI10','PCT_65OLDER10', 'PCT_18YOUNGER10','GROCPTH11', 'SUPERCPTH11', 'CONVSPTH11', 'SPECSPTH11', 'SNAPSPTH12', 'FFRPTH11', 'FSRPTH11', 'PC_FFRSALES07','PC_FSRSALES07', 'PCT_SNAP12', 'SNAP_PART_RATE11', 'PCT_NSLP12','PCT_SBP12', 'PCT_SFSP12','FDPIR12','FOODINSEC_12_14','VLFOODSEC_12_14','DIRSALES_FARMS07', 'FMRKTPTH13']df15_cols = ['PCT_LACCESS_POP15', 'PCT_LACCESS_LOWI15', 'PCT_LACCESS_HHNV15', 'PCT_LACCESS_SNAP15', 'PCT_LACCESS_CHILD15', 'PCT_LACCESS_SENIORS15', 'PCT_LACCESS_WHITE15','PCT_LACCESS_BLACK15','PCT_LACCESS_HISP15', 'PCT_LACCESS_NHASIAN15', 'PCT_LACCESS_NHNA15', 'PCT_LACCESS_NHPI15', 'PCT_LACCESS_MULTIR15', 'PCT_FREE_LUNCH15', 'PCT_REDUCED_LUNCH15', 'GROCPTH16', 'SUPERCPTH16', 'CONVSPTH16', 'SPECSPTH16', 'SNAPSPTH17', 'FFRPTH16', 'FSRPTH16','PC_FFRSALES12', 'PC_FSRSALES12', 'PCT_SNAP17','SNAP_PART_RATE16', 'PCT_NSLP17', 'PCT_SBP17', 'PCT_SFSP17','FDPIR15', 'FOODINSEC_15_17','VLFOODSEC_15_17', 'DIRSALES_FARMS12', 'FMRKTPTH18']# Foodhub, metro 13, medhincome 15 'POVRATE15' are not here``````{python}df10 = pd.DataFrame(df_avg, columns = df10_cols)df15 = pd.DataFrame(df_avg, columns = df15_cols)scaler = StandardScaler()df10 = pd.DataFrame(scaler.fit_transform(df10), columns=df10.columns)df15 = pd.DataFrame(scaler.fit_transform(df15), columns=df15.columns)df10.rename(columns =lambda x : str(x)[:-2], inplace=True)df15.rename(columns =lambda x : str(x)[:-2], inplace =True)X_train = df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK','PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 'PCT_65OLDER', 'PCT_18YOUNGER', 'VLFOODSEC_12_', 'FOODINSEC_12_'], axis=1)y_train = df10['FOODINSEC_12_']X_test = df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE','PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR', 'VLFOODSEC_15_', 'FOODINSEC_15_'], axis=1)y_test = df15['FOODINSEC_15_']X_train = X_train.drop(columns=['PCT_FREE_LUNCH', 'PCT_REDUCED_LUNCH', 'GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES','PC_FSRSALES', 'PCT_SNAP', 'SNAP_PART_RATE', 'PCT_NSLP', 'PCT_SBP','PCT_SFSP', 'FDPIR', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)X_test = X_test.drop(columns=['PCT_FREE_LUNCH', 'PCT_REDUCED_LUNCH', 'GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES','PC_FSRSALES', 'PCT_SNAP', 'SNAP_PART_RATE', 'PCT_NSLP', 'PCT_SBP','PCT_SFSP', 'FDPIR', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)import statsmodels.api as smX = sm.add_constant(X_train)model = sm.OLS(y_train, X).fit()model.summary()print('R-squared:', model.rsquared)print('P-values:', model.pvalues)from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(X_train,y_train)print('Test Score:', model.score(X_test,y_test))```The linear regression returns very low R2 scores suggesting that the food- accessibility variables may not entirely represent the level of food insecurity in states. ### Food Assistance ProgramsThe dataset gives several variables related to food assistance programs. Here, we look at some of these variables to see if these programs are available in areas with low food accessibility. The input variables here are Students eligible for free lunch (%), Students eligible for reduced-price lunch (%), SNAP participants (% pop), SNAP benefits per capita (% change), National School Lunch Program participants (%), School Breakfast Program participants (% children), Summer Food Service Program participants (% children), FDPIR (Food Distribution Program on Indian Reservations) sites. Regression models are created separately for 2010 and 2015 to see how the availability of food assistance programs and food accessibility are related for both years. Following the previous analysis, the same state-level data frame is used for the regression models. ```{python}# create 2010 and 2015 dataframes for statesdf10 = pd.DataFrame(df_avg, columns = df10_cols)df15 = pd.DataFrame(df_avg, columns = df15_cols)scaler = StandardScaler()df10 = pd.DataFrame(scaler.fit_transform(df10), columns=df10.columns)df15 = pd.DataFrame(scaler.fit_transform(df15), columns=df15.columns)df10.rename(columns =lambda x : str(x)[:-2], inplace=True)df10.rename(columns={'FOODINSEC_12_': 'FOODINSEC','VLFOODSEC_12_': 'VLFOODSEC'}, inplace=True)df15.rename(columns =lambda x : str(x)[:-2], inplace =True)df15.rename(columns={'FOODINSEC_15_': 'FOODINSEC','VLFOODSEC_15_': 'VLFOODSEC'}, inplace=True)df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK','PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 'PCT_65OLDER', 'PCT_18YOUNGER'], inplace=True)df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE','PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR'], inplace=True)# training and test datasets according to yearx10 = df10.drop(columns=['VLFOODSEC', 'FOODINSEC'], axis=1)y10 = df10['PCT_LACCESS_POP']x15 = df15.drop(columns=['VLFOODSEC', 'FOODINSEC'], axis=1)y15 = df15['PCT_LACCESS_POP']x10 = x10.drop(columns=['PCT_LACCESS_POP', 'PCT_LACCESS_LOWI', 'PCT_LACCESS_HHNV','GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES','PC_FSRSALES', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)x15 = x15.drop(columns=['PCT_LACCESS_POP', 'PCT_LACCESS_LOWI', 'PCT_LACCESS_HHNV','GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES','PC_FSRSALES', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)# 2010 modelX = sm.add_constant(x10)model = sm.OLS(y10, X).fit()model.summary()print("2010: ")print('R-squared:', model.rsquared)print('P-values:', model.pvalues)# 2015 modelX = sm.add_constant(x15)model = sm.OLS(y15, X).fit()model.summary()print("2015: ")print('R-squared:', model.rsquared)print('P-values:', model.pvalues)```For both years, the r2 scores are similar and less than 0.5. This suggests that the prevalence of food assistance programs is only partly correlated with food accessibility. Although food assistance programs exist to some extent in states with poor food accessibility, this may not be the case for all states.### Change in Food Accessibility (2010-15)Lastly, regression models are used to analyze what factors can affect food accessibility in U.S. counties. Note: The income and household vehicle availability variables are removed here since they are largely correlated with the food accessibility response variables.```{python}# create 2010 and 2015 dataframes for all countiesdf10 = pd.DataFrame(df, columns = df10_cols)df15 = pd.DataFrame(df, columns = df15_cols)# Rename columnsdf10.rename(columns =lambda x : str(x)[:-2], inplace=True)df10.rename(columns={'FOODINSEC_12_': 'FOODINSEC','VLFOODSEC_12_': 'VLFOODSEC'}, inplace=True)df15.rename(columns =lambda x : str(x)[:-2], inplace =True)df15.rename(columns={'FOODINSEC_15_': 'FOODINSEC','VLFOODSEC_15_': 'VLFOODSEC'}, inplace=True)# Drop columnsdf10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK','PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 'PCT_65OLDER', 'PCT_18YOUNGER'], inplace=True)df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE','PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR'], inplace=True)# Create dataframe with differences between 2010 & 2015 values as columnsdiff = pd.DataFrame()for c in df10.columns: diff[c] = df15[c]-df10[c]# Scale all columnsscaler = StandardScaler()diff = pd.DataFrame(scaler.fit_transform(diff), columns=diff.columns)# Drop correlating variables and create train/test dataX = diff.drop(columns=['PCT_LACCESS_POP','PCT_LACCESS_LOWI','PCT_LACCESS_HHNV'], axis=1)y = diff['PCT_LACCESS_POP']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Apply linear regressionX = sm.add_constant(X_train)model = sm.OLS(y_train, X).fit()model.summary()print('R-squared:', model.rsquared)print('P-values:', model.pvalues)```The model used here does not produce good R2 scores. One reason for this may be that the counties for which there are significant changes are low in number and may seem like unusual values for the dataset. This is shown by the distribution plot shown in @fig-extremediff. ```{python}#| label: fig-extremediff#| fig-cap: "Distribution of Change in Low Access Population (Standardized)"# Plot distribution of difference in low food access pop% def plt_dis(c): f = sns.displot(data=diff, x=c, height=4, aspect=10/8.27, bins=20) plt.ylabel("Number of Counties", alpha=0.8) plt.xlabel("Change in Low Food Access Pop. % (standardized)", alpha=0.8) plt.show()plt_dis('PCT_LACCESS_POP')```Now, we create a model using data for only the counties that have significant changes in food accessibility. For this, the dataset was filtered to contain only counties for which the standardized Change in Food Accessibility was greater than 1.5 and less than -1.5.```{python}# filter data for extreme casesdf_extreme = diff[diff['PCT_LACCESS_POP'] >1.5]df_extreme = diff[diff['PCT_LACCESS_POP'] <-1.5]# Drop correlated variables and create train/test dataX = df_extreme.drop(columns=['PCT_LACCESS_POP','PCT_LACCESS_LOWI','PCT_LACCESS_HHNV'], axis=1)y = df_extreme['PCT_LACCESS_POP']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Linear regressionprint("Linear Regression")X = sm.add_constant(X_train)model = sm.OLS(y_train, X).fit()model.summary()print('R-squared:', model.rsquared)print('P-values:', model.pvalues)# Decision Treeprint("Decision Tree Regression")dt_model = DecisionTreeRegressor(random_state=42)dt_model.fit(X_train, y_train)dt_pred = dt_model.predict(X_test)dt_score = dt_model.score(X_test, y_test)print("Decision Tree score:", dt_score)```The R2 score for linear regression improved significantly for this data frame. The R2 further improves when using a Decision Tree Regrressor.# Conlusion